Partial N-grams

نویسنده

  • Karl Pfleger
چکیده

This work presents a new model, called a partial n-gram, in which probability estimates for only some patterns from the full joint distribution are kept. The main experimental result shows that a partial n-gram model for one value of n can have better predictive performance than a full n-gram model for a smaller n where the two models have the same number of parameters. N-grams serve as one of the simplest and most commonly used models in sequential domains and are still considered state of the art for many problems, often outperforming more complex models. Thus, they have become a cornerstone of many interrelated fields including statistical natural language processing [4] and speech recognition [2], as well as universal data compression [1]. There are many variations on n-gram models differing on such issues as whether model parameters explicitly represent the full joint distribution over n symbols or the conditional distribution of the next symbol given the previous (n-1). This work presents a new model, the partial n-gram, in which probability estimates (in the form of appropriately smoothed, maximum likelihood estimating counts) are kept for some, but not all, possible n-symbol patterns from a known finite alphabet. Specifically, we consider keeping estimates for only the most frequent patterns. In place of counts for the remaining patterns, a uniform distribution over the remaining probability mass is used. Omitting probability estimates decreases predictive performance for fixed n, but for a fixed number of parameters (fixed memory requirement) partial n-grams can provide better performance. We performed experiments using book1 from the Calgary compression corpus [1], a Thomas Hardy novel, transformed to a 26-letter alphabet (using letters as the basic symbols, not words). We examined standard predict-the-next-symbol inference, though in general we are interested in arbitrary prediction patterns, such as predicting a middle symbol from context on both sides or simultaneously predicting multiple symbols [5]. (This generality necessitates representing estimates for the full joint distribution rather than the conditional distribution.) Accuracy and entropy were measured using a held out test set consisting of the last 10,000 characters. Accuracy is the standard predictive accuracy common in machine learning, the proportion of times the correct symbol was predicted by the model. Entropy here means the standard measure referred to as the entropy of the test data given the model, or the cross-entropy. We expected partial ngrams containing the most frequent patterns to perform well using accuracy, for which fine distinctions between infrequent patterns should be of little help, but to perform more poorly under the entropy measure. Let k-n-gram denote the n-gram that results from batch training a full n-gram, then discarding all but the most frequent k patterns. We used n=1-4 and found that the partial 26i-(i+1)-gram achieved higher accuracy than the full 26ii-gram in all three cases, i=1-3. Surprisingly, 17576-4-gram also achieved better entropy than 17576-3-gram, but in other cases full n-grams had better entropy scores. The following scores were obtained after training on the first 500,000 characters, which based on learning curves was enough data to achieve asymptotic performance even for width 4.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Skip N-grams and Ranking Functions for Predicting Script Events

In this paper, we extend current state-of-theart research on unsupervised acquisition of scripts, that is, stereotypical and frequently observed sequences of events. We design, evaluate and compare different methods for constructing models for script event prediction: given a partial chain of events in a script, predict other events that are likely to belong to the script. Our work aims to answ...

متن کامل

Should Syntactic N-grams Contain Names of Syntactic Relations?

In this paper, we discuss a specific type of mixed syntactic ngrams: syntactic n-grams with relation names, snr-grams. This type of syntactic n-grams combines lexical elements of the sentence with the syntactic data, but it keeps the properties of traditional n-grams and syntactic n-grams. We discuss two possibilities related to labelling of the relation names for snrgrams: based on dependencie...

متن کامل

N-gramas sintácticos no-continuos

In this paper, we present the concept of noncontinuous syntactic n-grams. In our previous works we introduced the general concept of syntactic n-grams, i.e., n-grams that are constructed by following paths in syntactic trees. Their great advantage is that they allow introducing of the merely linguistic (syntactic) information into machine learning methods. Certain disadvantage is that previous ...

متن کامل

Comparing word, character, and phoneme n-grams for subjective utterance recognition

In this paper, we compare the performance of classifiers trained using word n-grams, character n-grams, and phoneme n-grams for recognizing subjective utterances in multiparty conversation. We show that there is value in using very shallow linguistic representations, such as character n-grams, for recognizing subjective utterances, in particular, gains in the recall of subjective utterances.

متن کامل

Syntactic Dependency-Based N-grams as Classification Features

In this paper we introduce a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking the words as they appear in the text. Dependency trees fit directly into this idea, while in case of constituency...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007